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ABSTRACT 


Central  to  the  development  of  adaptive  pattern  processing  algorithms  (adaptive 
filters)  for  random  problems  —  problems  where  statistics  are  unknown  a  priori  and/or 
explicit  rules  governing  behavior  cannot  be  extracted  in  a  reductionist  manner  —  is  the 
pursuit  of  adaptive  architectures  for  associating  arbitrary  inputs  to  outputs.  Such 
"associative  memories"  are  important  for  providing  the  mathematical  mapping  (transfer 
function)  relating  inputs  to  outputs  arising  from  implicit  relationships  found  in  a  given 
training  ensemble.  The  adaption  of  these  filters  or  architectures  during  training  is  guided 
by  a  learning  algorithm,  mathematically  derived  from  an  objective  function  to  ensure  good 
association  properties. 

The  subject  of  this  paper  is  an  investigation  of  a  class  of  learning  algorithms  for  the 
highly  parallel  multilayer  perceptron  afchitecturfe  used  in  an  associative  memory  context. 
By  controlling  the  scheduling  of  patterns  presented  during  training,  a  generalized  class  of 
learning  algorithms  are  shown  to  result.  Specific  realizations  of  the  generalized  algorithm 
include  steepest  descent  (parameters  adapted  following  presentation  of  all  training 
patterns),  Rumelhart  back  propagation  (parameters  adapted  following  presentation  of  each 
pattern),  and  a  new  algorithm  which  captures  in  part  the  benefits  of  both,  less  parameter 
adaption  and  faster  convergence,  by  gradually  varying  the  number  of  patterns  presented  per 
parameter  adaption.- A  systematic  derivation  of  the  fundamental  steepest  descent  algorithm 
for  the  multilayer  perceptron  is  included  for  clarification,  although  related  learning 
algorithms  have  beeri  formulated  by  others. 

t  '  ,"'V 

V. _ _ -  ■ 

Learning  results  are  presented  utilizing  the  algorithms  on  a  simple  benchmark 
association  problem.  Although  performance  is  similar  amongst  the  algorithms,  the  relative 
computational  burden  differs  substantially.  Of  the  algorithms  investigated,  steepest  descent 
requires  the  least  computation,  while  back  propagation  is  the  most  demanding. 
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1.  INTRODUCTION 


1.1  OBJECTIVE 

The  objective  of  a  learning  algorithm  is  to  adjust  the  free  parameters  of  an  adaptive  system 
architecture  to  achieve  satisfactory  system  performance  in  a  given  problem  domain.  Often  the 
algorithm  can  be  derived  from  an  objective  or  risk  function,  which  mathematically  incorporates  the 
desired  objectives  of  the  problem  solution.  And  by  minimizing  the  objective  function  through 
execution  of  the  resulting  learning  algorithm,  a  set  of  free  parameters  result  yielding  the  defined 
optimal  system.  The  specific  embodiment  presented  entails  deriving  a  learning  algorithm  for  an 
adaptive  system  used  for  associating  arbitrary  input  and  output  vector  pairs.  This  arbitrary 
association  problem  —  associative  memory  —  is  quite  useful  in  the  classification  of  patterns 
required  for  many  pattern  processing  applications  including  speech  recognition,  machine  vision, 
and  decision  making. 

Assume  an  arbitrary  set  of  input/output  vector  pairs  comprising  the  training  set  T  is  given 
by 

r  ={(*',■/, ).(-?r  >2) . (*"„•/„)}.  <» 

the  specific  objective  being  to  derive  a  learning  algorithm  which  best  associates  the  input/output 
pairs  found  in  the  training  set.  To  rigorously  derive  the  best  algorithm,  the  system  architecture  and 
objective  function  must  first  be  defined.  Desired  objectives  include  perfect  recall  and 
generalization.  Mathematically,  these  objectives  can  be  satisfied  by  requiring  the  learning 
algorithm  to  adapt  the  free  parameters  of  the  system  or  machine  towards  a  realization  that  results  in 
a  mapping  function  F  (see  Fig.  1.)  such  that 

F ( x. )  =  y.  V  i  (perfect  recall)  (2) 

F(xj  +  n ) =  T/  V  /  (generalization).  (3) 

Note,  caution  must  be  exercised  when  interpreting  generalization  capability.  First, 
generalization  is  used  here  in  a  strict  mathematical  sense.  That  is,  if  the  system  is  excited  by  a 
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novel  input,  being  close  in  norm  to  the  masked  prototype  (*  =  x.  +  n  where 
x  2  T  J  n  /  <  8  )  and  a  correct  output  is  obtained,  then  the  system  is  said  to  possess 
generalization  capability.  Good  generalization  is  achieved  by  maintaining  large  distances  in  the 
input  space  between  the  inputs  having  different  outputs,  and  operating  well  within  the  memory 
capacity  [1]. 
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Fig.  1.  (a)  Machine  notation  with  inputs,  outputs  and  target  (desired)  outputs, 
(b)  Mathematical  associative  memory  mapping. 
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Confusion  often  arises  when  a  semantic  interpretation  of  generalization  is  applied  to  the 
mathematical  system.  Basically,  semantic  generalization  is  the  ability  to  generalize  semantic 
concepts.  As  an  example,  consider  teaching  a  person  to  identify  trees.  The  training  might  consist 
of  presenting  the  student  several  species  of  trees  with  the  respective  correct  labels  (i.e.,  birch,  pine, 
oak).  Now  if  the  student  is  then  presented  a  novel  tree,  say  an  apple  tree,  and  correctly  recognizes 
it  as  a  tree,  then  the  student  has  demonstrated  semantic  generalization.  And  indeed  this  type  of 
generalization,  giving  rise  to  robust  recognition,  is  intensely  sought  by  machine  intelligence 
researchers. 

The  distinction  between  mathematical  and  semantic  generalization  lies  in  the  representation  of 
the  data.  For  if  a  representation  is  chosen  for  the  tree  learning  problem  whereby  the  apple  tree 
input  vector  is  within  the  region  of  the  input  space  mapped  to  the  output  region  designating  "tree," 
mathematical  generalization  yields  the  desired  semantic  generalization.  Therefore,  the 
representation  of  the  data  determines  whether  semantic  generalization  is  achieved  in  a  system 
possessing  mathematical  generalization  capabilities. 

1.2  APPROACH 

With  the  desired  objectives  of  the  system  delineated,  the  specific  system  architecture  and 
objective  function  must  be  defined,  thereby  restricting  the  class  of  solutions  for  the  association 
problem. 

The  system  architecture  chosen  is  the  multilayer  perceptron  [2,3].  Illustrated  in  Fig.  2.,  the 
architecture  is  comprised  of  processing  units  and  interconnections.  Each  interconnection  has  an 
associated  connection  strength  or  weight.  The  complete  collection  of  weights  comprise  the  set  of 
free  parameters  for  the  adaptive  system.  Each  processing  unit  first  performs  a  weighted 
accumulation  of  the  respective  inputs  and  bias  value,  then  passes  the  result  through  a  threshold 
function. 

Reasons  supporting  the  architecture  choice  include  fast  system  operation,  achieved  by  the 
highly  parallel  circuitry.  Furthermore,  the  multilayer  architecture  is  quite  expressive.  That  is,  most 
mapping  functions  F  can  be  realized  with  the  architecture  shown.  In  fact,  Kolmogorov  proved 
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that  continuous  mapping  functions  of  several  inputs  can  be  expressed  by  a  three  layer  network  [4], 
However  the  theorem  and  extensions  [5,6]  are  existence  proofs,  and  hence  no  practical  learning 
algorithms  are  offered  to  direct  the  selection  of  the  connection  weights.  However  with  the 
processing  units  shown  in  Fig.  2b.,  together  with  connection  strengths  given  by  the  learning 
algorithms  to  be  discussed,  the  multilayer  perceptron  architecture  has  been  empirically  found  (via 
computer  simulation)  capable  of  mapping  complex  association  problems  [7-10]. 


Fig.  2.  (a)  Multilayer  perceptron  architecture  and  (h)  processing  unit  2  on  layer  2  (  N  =  number  of 
processing  units  on  layer  k). 
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Next  an  objective  (equivalently  risk,  energy,  performance)  function  must  be  specified  which 
incorporates  the  desired  objectives  in  a  mathematical  expression.  Again  for  the  association 
problem,  the  desire  is  to  construct  a  system  whose  output  best  approximates  the  behavior  found  in 
the  training  data,  resulting  in  perfect  recall  and  generalization.  The  objective  function  utilized  is  the 
sum  squared  error  over  all  training  patterns 

£=  I 

patterns 

where 

(v 2 

represents  the  error  per  pattern  as  the  squared  difference  of  the  machine  and  desired  outputs  when 
an  input  pattern  x  is  presented.  Consequently  the  approach  involves  deriving  a  learning 
algorithm  that  selects  the  free  parameters  (weights  w*.  )  to  minimize  the  overall  sum  squared  error 
(4),  thereby  forcing  the  outputs  of  the  system  to  mimic  the  outputs  displayed  in  the  training  set. 

Exploiting  the  general  architecture  and  objective  function,  a  number  of  learning  algorithms 
have  been  developed.  First,  Widrow  and  Hoff  [11]  developed  an  adaptive  network  wherein  the 
threshold  function  f  (x  )  is  linear.  The  resulting  LMS  (least  mean  square)  algorithm  and  related 
extensions  have  proven  successful  in  numerous  signal  processing  applications  requiring  adaptive 
filters  [12].  Parker  [13]  has  developed  a  least  squares  solution  which  simultaneously  minimizes 
the  changes  to  internal  architecture  parameters  during  training.  Thus  the  algorithm,  by  the  method 
of  Lagrange  multipliers,  actually  represents  a  solution  to  a  constrained  minimization  problem. 

The  current  approach  involves  first  presenting  a  systematic  derivation  of  the  steepest  descent 
learning  algorithm  for  the  described  architecture  with  an  arbitrary  nonlinear  yet  differentiable 
threshold  function  fix),  and  accompanying  squared  error  objective  function.  Next  modifications  of 
the  resulting  algorithm  are  shown  to  yield  the  Rumelhart  back  propagation  algorithm  [7]  and  a  new 
algorithm  which  capvures  in  part  the  benefits  of  both  steepest  descent  (less  weight  adaptions)  and 
back  propagation  (faster  convergence).  Results  are  then  presented  comparing  the  performance  of 


(4) 

(5) 
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the  algorithms  on  a  benchmark  association  problem.  Following,  the  relative  computation  burden  is 
assessed  and  the  sensitivity  of  the  proposed  algorithm  to  additional  a  priori  information  is 
demonstrated. 
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2.  LEARNING  ALGORITHMS 


A  systematic  derivation  of  the  steepest  descent  learning  algorithm  for  the  multilayer 
perceptron  architecture  with  accompanying  sum  squared  error  objective  function  is  presented. 
Following,  a  generalized  learning  algorithm  is  introduced  that  reduces  to  a  host  of  special  cases, 
including  steepest  descent  and  Rumelhart  back  propagation,  by  simply  controlling  the  manner  in 
which  the  training  patterns  are  presented. 


2.1  DERIVATION 

The  minimization  of  the  objective  function  with  respect  to  the  connection  strengths  ) 
is  conducted  numerically  by  a  steepest  descent  algorithm  (see  Appendix  for  basic  formulation), 
with  the  weight  adaption  given  by 


where 


w*  {n  +  1)  =  wk.(n)  +  Aw*  ( n ) 
ji  ji  jiy 


Aw*  (n)  =  - 
ji 


n 


dE 

dwl 

Ji 


w(n> 


(6) 

(7) 


The  iterative  process  in  initiated  with  w  (o) ,  a  pseudorandom  vector.  The  algorithm  (6,7)  simply 
forms  a  new  weight  value  by  stepping  from  the  present  position  in  the  direction  of  steepest 
descent,  wherein  the  size  of  the  step  is  governed  by  the  learning  rate  7)  . 

Clearly  the  learning  algorithm  (6,7)  is  completely  specified  once  the  gradient  of  the  error 
measure  is  derived  for  the  multilayer  system  architecture.  Proceeding,  computing  the  gradient 

dE  _  V  dEP 

dw k..  7  d*'k  (8) 

Ji  p  Ji 

requires  computing  the  partial  derivative  of  the  error  per  pattern  with  respect  to  the  weights,  thus 

by  (5) 
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dEp 

ji 


“  Pi 


(v-S)S 


(9) 


which  in  turn  necessitates  calculating  the  partial  derivatives  of  the  output  units  (layer  K  )  with 
respect  to  the  weights.  Recall  from  the  network  architecture,  the  expression  for  the  / th  output 
unit  is 


where 


V  = 

(10) 

-*,-x 

m 

(in 

rK  1 

f, 

(12) 

JL' 

Notice  from  the  architecture  (Fig.  2.),  ner{  represents  the  net  input  to  unit  /  of  layer  K  when 
excited  with  input  pattern  xp  ,  while  is  a  differentiable  threshold  function  for  unit  l  of  layer 
K  .  Consequently,  using  the  chain  rule  on  (10) 

{^net 

Notice  the  partial  derivative  of  the  net  input  to  the  output  unit  must  now  be  computed  with  respect 
to  the  weight  w*.  .  And  depending  upon  where  the  specific  weight  *»’*  occurs  in  the  network 

architecture,  two  algorithms  result. 


£)} 


dnetK. 
_ P>_ 

dwk 

Ji 


(13) 
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2.1.1.  Case  1.  Hidden  «->  Output 


Consider  the  weights  connecting  units  from  the  K  -  1  layer  (last  hidden  unit  layer)  to  the 
ATth  layer  (output  layer).  Here  the  layer  index  on  such  weights  becomes  k  =  K  yielding 

d»* 


d  K  V 
- i?  ne!  i  =  Z- 

dv*  pl 


Ji 


m 


?k  -  i _ !hl 

’pm  d*K 
ji 


-1 


where 


c*  ~  l8,  S 

Pm  l  -  j  m  -  i 

=  zK  ~  ]S  . 

pi  1  -  J 

(14) 

*  n  *  =  ° 

4  \o  ^  ^  0 

(15) 

is  the  Kroneker  delta  function.  Hence,  the  steepest  descent  learning  rule  for  the  last  layer  of 
weights  is  given  by  (6)  and  (7)  with 


dE 

ji 


= -  x  x  ( -v  -  v  y ;  K(  ^  y  ^ 

p  / 

=-X  (* 

pi  pj  r  j  \  pj  J  p> 

p 


where 


(16) 

(17) 
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represents  the  error  signal  derived  from  the  y'th  output  unit  upon  the  presentation  of  pattern  xp  . 
Consequently,  the  weights  w*  in  the  last  layer  are  adapted  following  the  presentation  of  all 

training  patterns  according  to 


Aw*  («) 

ji 


A K.zK  ~ 
PJ  P> 


2.1.2.  Case  2.  Hidden  ^Hidden  and  Input  nHidden 


(18) 


Next  consider  the  weights  connecting  units  within  the  hidden  unit  layers.  These  weights  are 
changed  sequentially  in  layers,  beginning  with  the  weights  between  hidden  units  in  layers  K  -  2 
and  K  -  1 ,  proceeding  backwards  until  reaching  the  first  hidden  units  in  the  layer  coupled  to  the 
inputs.  Beginning  with  the  weights  between  hidden  unit  layers  K  -  2  and  K  -  1 ,  the  partial 

is  1 

derivative  of  the  error  per  pattern  with  respect  to  the  weight  ,  following  (9)  through  (13),  is 


dEp 


( 


"e'pi ) 


ji 


i 


(19) 


where  now 


dnetK 

pl  _  V 

d*K.  -1 

JI 


w 


m 


K 

Im 


dzK  -  1 

u-  pm 
Ji 


(20) 


Substituting  the  expression  for  the  mth  unit  of  layer  K  -  1 ,  the  resulting  term  necessary  for 
evaluating  (20)  becomes 


d 

d»’K  -1 

Ji 


-K  - 

*■ pm 


net 


K  -  1 

pm 
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Finally,  substituting  (21),  (20),  and  (19)  into  (8)  yields  the  gradient 


BE 


n=-H  (v  ' :KpiVi\ne,Kpi) 


J‘ 


p  l 


•  y,  /  (nf/j 

/m-'  m  \  p™  J  pi  m  -  j 
m 


p  / 


a:  -2 

p' 


=-x(x 


where 


py 


=  -I 

,  p;  p< 


(22) 


(23) 


K  —  1 

and  thus  the  error  signal  A  .  driving  the  adaption  of  the  weights  in  the  K  -  \  layer  is 

PJ  % 

expressed  recursively  as  a  weighted  sum  of  the  error  signals  A  computed  in  the  previous  layer. 
This  recursion  (23)  allows  the  adaption  of  the  weights  hidden  between  layers  of  hidden  units 
which  do  not  have  explicit  target  values,  as  do  the  output  units. 


11 


K  —  1 

Consequently,  the  weights  w*  in  the 
presentation  of  all  training  patterns  according  to 


K  -  1  layer  are  adapted  following  the 


V»)=  n  Xa* 

p 


(24) 


This  procedure  is  repeated  layer  by  layer  until  the  first 
adapted  by 

Am*-,  (n)  =  T]  X 
1  P 

where 

z°  =  x  .  1  =  1 

pi  pi 

In  summary,  the  learning  procedure  consists  of  cycling  through  two  phases  repeatedly  until 

appropriate  performance  criteria  are  satisfied.  The  initial  phase  —  forward  propagation  —  entails 

passing  an  input  training  pattern  ~xp  through  the  forward  circuitry  to  obtain  the  network  output 

F  .  Next,  the  back  propagation  phase  is  conducted  by  forming  error  signals  A  between  the 

_  PS 

output  zp  and  the  desired  target  value  yp.  Now  in  a  recursive  fashion  demonstrated  in  Fig.  3. 
the  error  signals  derived  from  the  output  layer  are  subsequently  used  to  form  new  error  signals 
A  ,  which  guide  weight  adaptions  in  layer  ( K  -  1)  per  (24).  The  back  propagation 
recursion  is  completed  when  the  first  hidden  unit  layer  is  reached.  Following  the  completion  of 
one  training  cycle  (forward  propagation  — »  back  propagation),  a  new  training  pattern  is  presented 
and  the  cycle  is  repeated.  Cycling  through  all  training  patterns  once  allows  the  adaption  of  the 
weights  according  to  (27),  and  is  referred  to  as  a  training  sweep. 

The  complete  circuitry  required  for  the  forward  and  backward  propagation  phases  is  shown 
in  Fig.4.  for  a  network  with  two  hidden  units.  As  observed,  the  compromise  for  the  fast  parallel 
forward  circuitry  (shown  in  bold)  is  the  complicated  back  propagation  feedback  circuit.  However 
once  trained,  the  weights  are  fixed  and  thus  only  the  forward  circuitry  is  necessary  for  operation. 


layer  is  reached,  upon  which  the  weights  are 


A1.;0 
pj  pi 


(25) 


2,...,  N 


(26) 
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LEARNING  ALGORITHM 


ZK.  y  . 

PJ  PJ 


Fig.  3.  Schematic  of  learning  procedure. 
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unit  network. 


2.2  GENERALIZATION  AND  SPECIAL  CASES 


To  motivate  the  generalization  of  the  learning  algorithm,  consider  the  weight  adaption 
process  as  a  numerical  search  through  a  multidimensional  weight  space.  The  trajectory  is  governed 
by  the  general  weight  adaption  rule 


J  ( n ) 

A w*(n)  =  rj  X  A*  zk  ~  1  ng\ 

ji  ^  PJ  Pi  *  y-y) 

p  =  i 

and  the  objective  is  to  find  the  location  where  the  objective  function  [sum  squared  error  (4,5)]  is 
minimum. 

Now  according  to  the  steepest  descent  learning  algorithm  (J  {n)  =  M  ),  the  weights  are 
adapted  following  the  presentation  of  all  training  patterns.  Thus  much  caution  is  taken  in  choosing 
a  trajectory,  for  the  response  of  the  system  to  the  entire  training  set  is  considered  before  each  step 
is  taken. 

Conversely,  the  Rumelhart  back  propagation  algorithm  [7]  results  in  a  more  erratic  search. 
Here  the  weights  are  adapted  following  the  presentation  of  a  single  pattern  in  the  training  set 
(J  (n  )  =  1 ) .  Intuitively,  such  an  erratic  trajectory  is  desired  for  initially  searching  a  vast 
multidimensional  space.  However,  the  noisy  trajectory  can  be  prohibitive  in  the  latter  phases  of  the 
search,  causing  the  new  step  to  wander  away  from  the  near  optimal  region.  In  fact,  such  trajectory 
jitter  is  often  countered  with  smoothing  by  placing  a  momentum  term  in  the  learning  equation  [7] 
(actually  a  single  pole  low  pass  filter  A «v(n  +  1)  =  a  Ahv  (  n)  +  rj(A  z  ) )  . 

A  compromise  is  postulated  to  gain  the  benefit  of  both  extremes.  At  the  beginning  of  the 
search,  the  weights  are  adapted  after  each  training  pattern  is  presented,  while  toward  the  end  of  the 
search,  all  training  patterns  are  presented  before  the  weights  are  adapted.  The  variation  is  gradual 
as  depicted  in  Fig.  5. 
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2QM 


TRAINING  CYCLES 


Fig.  5.  Graded  training  schedule. 


Such  schedule  requires  M  Q  training  cycles  ( M  durations  at  QM  cycles  per  duration), 
while  the  total  number  of  weight  adaptions  follow  a  finite  harmonic  series 


I  7  . 

U  = 1 


Q  is  chosen  large  enough  to  ensure  satisfactory  training.  Typically  Q  is  estimated  by 


Q  =  1  /  M 


where  J  is  the  average  number  of  training  sweeps  required  to  learn  a  given  mapping  using  the 
Rumelhart  algorithm  [7]. 
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In  consequence,  three  distinct  learning  algorithms  emerge  as  special  cases  of  the 
generalized  weight  adaption  rule  (29).  These  cases  are  distinguished  by  the  manner  in  which  the 
training  patterns  are  presented  per  weight  adaption  according  to 


J  (n)=< 

M  . 

1 . 

M  ( 

£  U  n  -QM 
<7  =  1  s 

steepest  descent 

•  •  Rumelhart 

harmonic 

ZJ 

r  =  1  , 

where 

is  the  unit  step  function. 

lllllllllllllllllllimillllllll 
iiiiiiii  1 1 1 1  i  i  i  i 

(32) 

0  <  n 
n  <  0 


17 


3.  RESULTS 


The  performance  comparison  via  simulation  consists  of  exercising  the  algorithms  on  a 
benchmark  association  problem.  Given  the  training  set,  the  learning  process  is  initiated  with  a 
pseudorandom  weight  vector  and  is  terminated  when  the  outputs  of  the  system  are  all  within  10% 
of  the  target  values 


<0.1Vp,y) 


(33) 


denoting  zero  decision  errors.  In  addition  to  the  tabulation  of  the  decision  errors  versus  training 
sweep,  the  root-mean-squared  error  (RMS  =  \  2E/ML)is  also  included.  Furthermore,  the  relative 
computational  burden  of  the  algorithms  is  discussed,  along  with  the  sensitivity  of  the  proposed 
algorithm  to  the  additional  a  priori  information  required. 

3.1  BENCHMARK  ASSOCIATION  PROBLEM 


The  classical  benchmark  problem  for  analyzing  perceptron  based  architectures  is  the 
exclusive  -  OR  (XOR)  problem  [3].  The  XOR  mapping 


T  = 


(34) 


involves  second-order  correlations  and  hence  requires  hidden  units.  The  XOR  performance 
comparison  is  conducted  by  executing  each  of  the  three  algorithms  supplied  with  identical 
pseudorandom  weight  vectors.  Both  RMS  and  decision  errors  are  tabulated  as  a  function  of  the 
training  sweep.  (  A  training  sweep  is  defined  as  M  training  cycles.)  Following  200  independent 
training  sessions  for  each  algorithm,  statistics  are  gathered  and  the  architecture  is  then  modified  by 
changing  the  number  of  hidden  units.  Results  of  the  procedure  shown  in  Fig.  6.  include  the 
average  time  required  by  the  algorithms  to  learn  the  XOR  mapping  (  expressed  by  the  number  of 
training  sweeps)  as  a  function  of  the  number  of  hidden  units.  Also  shown  are  the  least-squares 
linear  Fit  expressions  [14]  for  each  of  the  algorithms,  yielding  a  linear  variation  in  the  training  time 
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versus  the  logarithm  of  the  number  of  hidden  unit  processors.  Such  relationship  reinforces  the 
trade-off  between  training  time  and  architecture  size  for  each  of  the  algorithms. 

Notice  the  algorithms  exhibit  similar  performance.  The  harmonic  algorithm  experiences 
slightly  faster  convergence  for  the  larger  architectures,  while  the  Rumelhart  algorithm  is  slightly 
superior  at  the  smaller  architectures,  and  the  steepest  descent  algorithm  possesses  the  slowest 
convergence. 


0  1  2  3  4  5  6  7 

HIDDEN  UNITS  (logfBase  2)) 


Fig  6  Averaged  results  for  XOR  problem.  Network  parameters  include  tj  =  0.25,  a  =  0.90. 
Error  bars  denote  ±  (7  . 


r» 
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3.2  COMPUTATION  REQUIREMENT  COMPARISON 


Given  an  arbitrary  association  problem,  analytically  assessing  the  absolute  computational 
requirements  for  successful  learning  using  the  learning  algorithms  discussed  would  be  difficult. 
However,  if  the  number  of  training  cycles  is  maintained  equivalent  for  each  of  the  algorithms,  a 
relative  computation  comparison  is  straightforward.  And  provided  the  algorithms  yield  similar 
performance  on  the  given  association  task  (as  demonstrated  in  the  above  XOR  problem),  such 
comparison  serves  to  accurately  depict  the  differences  in  computational  requirements. 

Now  since  each  algorithm  employs  the  same  circuitry  to  conduct  a  training  cycle  (see  Fig.4.), 
the  difference  in  computation  lies  primarily  in  the  amount  of  weight  adaptions  required.  Thus  the 
quantity  of  interest  is  the  relative  amount  of  weight  adaptions  required  by  each  of  the  learning 
algorithms,  assuming  an  equal  number  of  training  cycles  conducted. 

2 

Assuming  QM~  training  cycles  are  required  to  learn  a  given  mapping,  then  the  number  of 

R  9 

weight  adaptions  required  for  the  Rumelhart  algorithm  is  N  =  QM  (since  the  weights  are 

Ah 

adapted  following  each  training  cycle),  while  the  steepest  descent  algorithm  requires  N  =  QM 
(since  the  weights  are  adapted  following  M  training  cycles).  Finally,  the  number  of  net  weight 
adaptions  for  the  harmonic  algorithm  is 


and  is  seen  to  be  a  fraction  of  those  required  for  the  Rumelhart  algorithm.The  growth  of  the  weight 
adaptions  as  a  function  of  the  number  of  training  cycles  is  illustrated  in  Fig.  7. 
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TRAINING  CYCLES 

Fig.  7.  Cumulative  weight  adaptions  for  the  learning  algorithms. 

Notice  the  harmonic  algorithm  asymptotically  reaches  the  rate  of  adaptions  required  by  steepest 
descent,  while  the  Rumelhart  algorithm  grows  at  a  rate  M  times  faster. 

Therefore  assuming  equal  number  of  training  cycles  for  each  algorithm,  the  harmonic 
algorithm  requires  significantly  less  weight  adaptions  than  conventional  back  propagation  (35), 
although  not  as  few  as  the  steepest  descent  algorithm.  In  fact  for  the  XOR  problem 
(*"  =  52%  Nl  =208%  the  harmonic  algorithm  requires  about  half  the  weight 

updates  required  for  Rumelhart  back  propagation,  although  twice  as  many  for  steepest  descent. 

3.3  SENSITIVITY  TO  ADDITIONAL  A  PRIORI  INFORMATION 

The  relative  sensitivity  of  the  harmonic  algorithm  with  respect  to  additional  a  priori 
information  is  addressed.  The  additional  information  consists  of  specifying  the  training  duration 
parameter  Q  . 
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As  noted  previously  Q  is  typically  estimated  by  running  the  Rumelhart  algorithm  repeatedly 
on  the  training  set  to  obtain  the  average  number  of  training  sweeps.  However,  this  procedure  can 
be  time  consuming.  Alternatively  Q  may  be  arbitrarily  selected  provided  the  performance  remains 
satisfactory.  The  following  sensitivity  study  illustrates  the  degradation  incurred  for  arbitrary 
selections  of  Q  for  the  benchmark  association  problem. 

The  study  involves  varying  the  value  of  Q  about  the  nominal  value  (Q  =  Q0  =  s  /  M  ) . 
Note  the  limiting  cases  (Q  =  0,  Q  =  <*>)  represent  the  previously  compared  algorithms  as 
illustrated  in  Fig.  8. 


STEEPEST 

DESCENT 

mSmm 

Po5™ 

|20150% 

Pi 

| 

\ 

■■■ 

1 _ 

RUMELHART  Q  =  °o 

Fig.  S.  Training  schedules  produced  by  varying  duration  parameter  Q  . 


The  least-squares  linear  approximations  to  the  training  performance  data  compiled  with  the 
various  Q  values  for  the  same  200  independent  sessions  used  in  3.1  are  shown  in  Fig.  9.  The 
sensitivity  appears  most  pronounced  for  the  excessively  large  architectures  (37%  at  AQ  =  Q0  -  0 
at  2  hidden  units)  and  almost  negligible  at  the  smaller  architectures  matched  to  the  XOR  problem 
complexity  (3%  at  AQ  =  Qa  -  0  at  2  hidden  units).  Notice  the  performance  of  the  various  Q 
cases  are  upper  bounded  by  steepest  descent,  Q  =  0 .  Thus  for  the  problem  tested,  improper  a 
priori  estimation  of  Q  results  at  worst  in  the  training  performance  characteristic  of  the  steepest 
descent  algorithm. 
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Fig.  9.  Training  results  for  variations  in  training  schedules. 


4.  CONCLUSION 


In  conclusion,  a  generalized  learning  algorithm  was  introduced  for  the  multilayer  perceptron 
which  reduces  to  a  host  of  special  cases  —  steepest  descent,  Rumelhart  back  propagation,  and  the 
proposed  harmonic  algorithm  —  simply  by  altering  the  scheduling  of  the  patterns  presented  during 
training.  This  general  technique  of  modifying  the  scheduling  of  the  training  patterns  is  applicable 
to  a  variety  of  iterative  minimization  techniques,  possibly  resulting  in  favorable  compromises  as 
demonstrated  here  in  the  associative  memory  context. 

Within  such  context,  the  training  schedule  for  the  proposed  harmonic  algorithm  represents  a 
specific  compromise  where  the  search  for  the  optimal  weight  vector  begins  by  adapting  the  weights 
following  the  presentation  of  each  training  pattern,  while  concluding  by  adapting  the  weights 
following  the  presentation  of  all  training  patterns. 

The  results  indicate  similar  training  performance  amongst  the  three  algorithms  compared, 
although  significant  differences  arise  in  the  amount  of  weight  adaptions  required.  The  steepest 
descent  algorithm  requires  the  least  adaptions,  followed  by  the  harmonic  algorithm  (weight 
adaption  rate  asymptotically  approaches  the  favorable  steepest  descent  rate),  and  finally  the 
Rumelhart  algorithm  (weight  adaption  rate  M  times  greater  than  steepest  descent). 

In  all,  the  choice  of  the  learning  algorithm  employed  for  a  given  association  problem  is 
dependent  upon  the  training  performance  comparison  for  such  problem  and  the  priority  given  to  the 
relative  computational  burden. 
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APPENDIX 


The  method  of  steepest  descent  can  be  intuitively  formulated  with  the  following 
example.  Consider  the  single  dimension  case  where  G(w)  is  to  be  minimized  with  respect 
to  the  independent  variable  w  .  The  function  G  can  be  viewed  as  a  landscape  with  hills 
and  valleys,  as  shown  in  Fig.  A- 1 . 


Fig.  A-I.  Landscape  interpretation  of  minimization. 


With  such  interpretation,  the  goal  of  the  minimization  process  is  to  travel  from  an  initial 
position  upon  a  hillside  to  the  final  destination  in  the  valley  below.  From  the  diagram,  the 
traveler  should  travel  east  (increasing  w )  when  on  the  western  hillside  in  order  to  reach  the 
valley.  Notice  the  slope  of  the  western  hillside  is  negative  while  the  eastern  hillside  is 
positive.  Hence  the  traveler's  plan  can  be  stated  mathematically  as  always  moving  in  the 
direction  opposite  to  the  polarity  of  the  derivative  at  the  present  location.  Thus  the 
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traveler's  rule  guiding  the  next  step  can  be  formulated  in  terms  of  the  present  step  according 
to 

win  +  1)  =  h*(#i)  +  Aw(n)  (36) 


where 

—I  "l 

[w  (n)J  (37) 


and  t]  represents  the  traveling  rate  (dependent  upon  the  traveler's  physical  condition). 


Aw(/i)  =  -  r\  sign 


A  refinement  can  be  made  by  realizing  the  traveler  should  take  large  steps  when 
high  on  the  hillside  far  from  the  valley  (large  |  dG  /  <9w  | )  ,  and  conversely  take  small 
steps  when  nearing  the  valley  (small  | dG  /  dw  J  ).  Mathematically,  this  is  equivalent  to 
making  the  step  size  proportional  to  the  magnitude  of  the  derivative.  Consequently,  the 
refined  rule  is  (36)  with 


Aw  ( n ) 


(38) 


and  this  minimization  process  is  diagrammed  in  Fig.  A-l. 

The  final  extension  involves  expanding  the  landscape  in  many  directions.  Here  G 
is  now  a  function  of  several  independent  variables,  denoted  by  the  vector  w  .  And  the 
traveler  simply  moves  in  the  direction  of  steepest  descent  yielding  the  following  algorithm 
applied  to  each  component. 


where 


w^in  +  1)  =  w/(n)+  A w  (n) 


,  ,  ,  dG(w), 

Aw,  (/?)  =  -  T] - T- — 

1  2w,  F(n) 


(39) 


(40) 


Disadvantages  of  the  algorithm  include  the  inability  to  escape  from  local  minima  as  well  as 
excessive  computation  times  for  large  dimensions. 
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